Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts
نویسندگان
چکیده
In this paper we introduce a machine learning approach for automatic text segmentation. Our text segmenter clusters text-segments containing similar concepts. It first discovers the different concepts present in a text, each concept being defined as a set of representative terms. After that the text is partitioned into coherent paragraphs using a clustering technique based on the Classification Maximum Likelihood approach. We evaluate the effectiveness of this technique on sets of concatenated paragraphs from two collections, the 7sectors and the 20 Newsgroups corpus, and compare it to a baseline text segmentation technique proposed by Salton et al.
منابع مشابه
Extraction and 3D Segmentation of Tumors-Based Unsupervised Clustering Techniques in Medical Images
Introduction The diagnosis and separation of cancerous tumors in medical images require accuracy, experience, and time, and it has always posed itself as a major challenge to the radiologists and physicians. Materials and Methods We Received 290 medical images composed of 120 mammographic images, LJPEG format, scanned in gray-scale with 50 microns size, 110 MRI images including of T1-Wighted, T...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملA Thematic Segmentation Procedure for Extracting Semantic Domains from Texts
Thematic analysis is essential for a lot of Natural Language Processing (NLP) applications, such as text summarization or information extraction. It is a two-dimensional process which has both to identify the thematic segments of a text and to recognize the semantic domain concerned by each of them. This second task requires having a representation of these domains. Such representations are bui...
متن کاملLanguage Segmentation
Language segmentation consists in finding the boundaries where one language ends and another language begins in a text wrien in more than one language. is is important for all natural language processing tasks. e problem can be solved by training language models on language data. However, in the case of lowor no-resource languages, this is problematic. I therefore investigate whether unsuper...
متن کاملOn Unsupervised Learning of Mixtures of Markov Sources Thesis submitted for the degree \Master of Science"
Unsupervised classi cation, or clustering, is one of the basic problems in data analysis. While the problem of unsupervised classi cation of independent random variables has been deeply investigated, the problem of unsupervised classi cation of dependent random variables, and in particular the problem of segmentation of mixtures of Markov sources, has been hardly addressed. At the same time sup...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004